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Abstract 

We consider solving the £i-regularized least-squares (£i-LS) problem in the context of sparse 

recovery, for applications such as compressed sensing. The standard proximal gradient method, 

also known as iterative soft-thresholding when applied to this problem, has low computational 

^^ , cost per iteration but a rather slow convergence rate. Nevertheless, when the solution is sparse, it 

r^ ' often exhibits fast linear convergence in the final stage. We exploit the local linear convergence 

using a homotopy continuation strategy, i.e., we solve the ^i-LS problem for a sequence of 

decreasing values of the regularization parameter, and use an approximate solution at the end 

2 . of each stage to warm start the next stage. Although similar strategies have been studied in the 

literature, there have been no theoretical analysis of their global iteration complexity. This paper 

shows that under suitable assumptions for sparse recovery, the proposed homotopy strategy 

ensures that all iterates along the homotopy solution path are sparse. Therefore the objective 

^ , function is effectively strongly convex along the solution path, and geometric convergence at 

CsJ ■ each stage can be established. As a result, the overall iteration complexity of our method is 

^3? I 0(log(l/e)) for finding an e-optimal solution, which can be interpreted as global geometric rate 

of convergence. We also present empirical results to support our theoretical analysis. 
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'^ ' 1 Introduction 

^^ ' 

In this paper, we propose and analyze an efficient numerical method for solving the ii-regularized 
least-squares (^i-LS) problem 

><; 1 

C^ , minimize — ll^lx — ftlU + A||x||i, (1) 

- - - X 2 

where x G M" is the vector of unknowns, A G ]^"«x" and b € M™ are the problem data, and A > is 
a regularization parameter. Here || • ||2 denotes the standard Euclidean norm, and ||x||i = ^^ \xi\ 
is the ii norm of x. This is a convex optimization problem, and we use x*{X) to denote its (global) 
optimal solution. Since the ii term promotes sparse solutions, we also refer problem ([T]) as the 
sparse least-squares problem. 

The ii-hS problem has important applications in machine learning, signal processing, and 
statistics; see, e.g., |Tib961 [CDS981 [BDE09| . It received revived interests in recent years due to 
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the emergence of compressed sensing theory, which builds upon the fundamental idea that a finite- 
dimensional signal having a sparse or compressible representation can be recovered from a small 
set of linear, nonadaptive measurements |CRT061 ICT061 IDon06j . We are especially interested in 
solving the ^i-LS problem in such a context, with the goal of recovering a sparse vector under 
measurement noise. More precisely, we assume A and b in ([1]) are related by a linear model 

b = Ax + z, 

where x is the sparse vector we would like to recover in statistical applications, and z is a noise 
vector. We assume that the noise level, measured by ||A 2;||oo, is relatively small compared with the 
regularization parameter A. This scenario is of great modern interest, and various properties of the 
solution x*{X) have been investigated [C^TOSl IDETOGl IMBOGl ITroOBl [ZYnHj K^TO?! \ZHM\ IZhaOQl 
IBRT091 IKol091 IvdGBOOl IWai09j . In particular, it is known that under suitable conditions on A 
such as the restricted isometry property (RIP), and as long as A > c||^"^2;||oo (for some universal 
constant c), one can obtain a recovery bound of the optimal form 

\\x*{X)-x\\l = 0{\'\\x\\o), (2) 

where ||x||o denotes the number of nonzero elements in x. The constant in O(-) depends only on 
the so-called RIP condition that we will discuss later on, and this bound achieves the optimal order 
of recovery. Moreover, it is known that in this situation, the solution x*(A) is sparse |ZH08] . and 
the sparsity of the solution is closely related to the recovery performance. 

In this paper, we develop an efficient numerical method for solving the ^i-LS problem in the 
context of sparse recovery described above. In particular, we focus on the case when m < n (i.e., 
the linear system Ax = b is under deter mined) and the solution re* (A) is sparse (which requires 
the parameter A to be sufficiently large). Under such assumptions, our method has provable lower 
complexity than previous algorithms. 

The ^i-LS problem ([T]) is closely related to the following two constrained convex optimization 
problems: 

minimize ||Ax — 6II2 subject to ||x||i < A, (3) 

X 

known as the least absolute shrinkage and selection operator (LASSO) |Tib96j . and 

minimize \\x\\i subject to \\Ax — b\\2 < e, (4) 

X 

where A and e are two nonnegative real parameters. These problems have the same solution as ([TJ 
for appropriate choices of the parameters A, A and e. However, other than in some special cases, the 
exact correspondence between these parameters are not known a priori. Therefore, algorithms that 
are specific for solving one formulation may not be used directly for solving others. Nevertheless, 
our method can be adapted to solve ^ and Q efficiently, either by using an augmented Lagrangian 
approach [YOGDOS] , or by using a root-finding procedure similar as the one given in |vdBF08] . 

1.1 Previous algorithms 

There have been extensive research on numerical methods for solving the problems ([1]) , ([3]) and (JH . 
A nice survey of major practical algorithms for sparse approximation appeared in |TW10j . and 
performance comparisons of various algorithms can be found in, e.g., |WNF09] IWYGZlOl IBBCll] , 



Here we briefly summarize the computational complexities of several methods that are most relevant 
for solving the £i-LS problem ([1]), in terms of finding an e-optimal solution (i.e., obtaining an 
objective value within e of the global minimum). 

Interior-point methods were among the first approaches used for solving the ^i-LS problem 
ICDS981 ITVW051 |KKL+07| . The theoretical bound on their iteration complexity is O (V^log(l/e)), 
although their practical performance demonstrate much weaker dependence on n. The bottleneck 
of their performance is the computational cost per iteration. For example, with an unstructured 
dense matrix A, the standard approach of solving the normal equation in each iteration with a direct 
method (Cholesky factorization) would cost Oijn^n) fiops, which is prohibitive for large-scale appli- 
cations. Therefore all customized solvers |CDS98l ITVWOSl lKKL"'"07j use iterative methods (such 
as conjugate gradients) for solving the linear equations. These methods only require matrix-vector 
multiplications involving A and A?" , and the computational cost per iteration can be 0{mn). The 
cost can be further reduced if the matrix-vector multiplication can be conducted more efficiently, 
e.g., 0{n\ogn) if A is a partial Fourier matrix. 

Proximal gradient methods for solving the £i-LS problem take the following basic form at each 
iteration fe = 0, 1, . . . 



argmin|/(xW) + V/(x(^))^(y-xW) + ^||2/-x(^)||2 + A||y||i| 



where we used the shorthand f{x) = (l/2)||^x — ^H^, and L^ is a parameter chosen at each iteration 
(e.g., using a line-search procedure). The minimization problem in ([5]) has a closed- form solution 

x^'^+i) = soft fx^'^) - -^V/(x('^)) , A") , (6) 

where soft : M" x M"*" — > M" is the well-known soft-thresholding operator, defined as 

(soft(x, a))j = sgn(xj) max{|xj| — a, 0} , i = 1, . . . ,n. (7) 

Iterative methods that use the update rule © include |DDMn41 [CWOHl INesOTI IHYZOSI IWNFOQj . 
Their major computational effort per iteration is to form the gradient V/(x) = A (Ax — b), which 
costs 0{mn) flops for a generic dense matrix A. With appropriate choices of the parameters L^, 
the proximal-gradient method ^ has an iteration complexity 0(l/e). 

Indeed, the iteration complexity 0(log(l/e)) can be established for ([5]) if m > n and A has full 
column rank, since in this case the objective function in ([T|) is strongly convex |Nes07j . Unfortu- 
nately this result is not applicable to the case m < n. Nevertheless, when the solution x*(A) is 
sparse and the active submatrix is well conditioned (e.g., when A has RIP), local linear convergence 
can be established |LT92^ IHYZ08| , and fast convergence in the flnal stage of the algorithm has also 
been observed |Nes07l IHYZ08llWNF09] . 

Variations and extensions of the proximal gradient method have been proposed to speed up 
the convergence in practice; see, e.g., |BDF07t IWNFODJ IWYGZIO] . Nesterov's optimal gradient 
methods for minimizing smooth convex functions |Nes831 INes041 INes05j have also been extended to 
minimize composite objective functions such as in the ii-LS problem |Nes07l [Tse08l IBTOD^ IBBC 11 j . 
These accelerated methods have the iteration complexity 0{l/^/e). They typically generate two or 
three concurrent sequences of iterates, but their computational cost per iteration is still 0{mn), 
which is the same as simple gradient methods. 



Exact homotopy path-following methods were developed in the statistics literature to compute 
the complete LASSO path when varying the regularization parameter A from large to small 
[OPTOOal lOPTOObl IEHJT04] . These methods exploit the piece- wise linearity of the solution as 
a function of A, and identify the next breakpoint along the solution path by examining the optimal- 
ity conditions (also called active set or pivoting method in optimization). With efficient numerical 
implementations (using updating or downdating of submatrix factorizations), the computational 
cost at each break point is 0{mn + ms'^), where s is the number of nonzeros in the solution at 
the breakpoint. Such methods can be quite efficient if s is small. However, in general, there is no 
convergence result bounding the number of breakpoints for this class of methods (for some special 
cases, the number of breakpoints is the same as the number of nonzeros in the solution |DT08] ). 

Greedy algorithms such as orthogonal matching pursuit (OMP) are also very popular for sparse 
recovery applications (e.g., JDMA971 ITro041 lNT09j ). However, they are not designed to solve any 
of the optimization problems ([I]) , ([3]) or (jl]) . Their connections with exact homotopy methods are 
analyzed in |DT08| . 

1.2 Proposed approach and contributions 

We consider an approximate homotopy continuation method, where the key idea is to solve ([1]) 
with a large regularization parameter A first, and then gradually decreases A until the target 
regularization is reached. For each fixed A, we employ a proximal gradient method of the form ([5]) 
to solve ([T]) up to an adequate precision (to be specified later), and then use this approximate 
solution to serve as the initial point for the next value of A. We call the resulting method proximal- 
gradient homotopy (PGH) method. 

This is not a new idea. Approximate homotopy continuation methods that use proximal gradient 
methods for solving each stage (with a fixed value of A) have been studied in, e.g., |HYZ081 IWNF091 
IWYGZiO] . and superior empirical performance have been reported when the solution is sparse. 
However, there has been no effective theoretical analysis for their overall iteration complexity. As a 
result, some important algorithmic choices are mostly based on heuristics and ad hoc factors. More 
specifically, how do we choose the sequence of decreasing values for A? and how accurate should 
we solve the problem ([1]) for each value in this sequence? 

In this paper, we present a PGH method that has provable low iteration complexity, along with 
the following specific algorithmic choices: 

• We use a decreasing geometric sequence for the values of A. That is, we choose a Aq and a 
parameter rj E (0, 1), and let \k = V^ ^o for K = 1,2, . . . until the target value is reached. 

• We choose a parameter 6 £ (0, 1) and solve problem ([T|) for each Xk with a proportional 
precision 5Xk (in terms of violating the optimality condition) , except that for the final target 
value of A, we reach the absolute precision e. 

• We use Nesterov's adaptive line-search strategy in |Nes07| to choose the parameters L/j in the 
proximal gradient method ([5]). 

Under the assumptions that the target value of A is sufficiently large (such that the final solution 
is sparse) and the matrix A satisfies a RIP-like condition, our PGH method exhibits geometric 
convergence at each stage, and the overall iteration complexity is 0(log(l/e)). The constant in 
O(-) depends on the RIP-like condition. Moreover, it is sufficient to choose A > c||A 2;||oo (for some 



universal constant c), which imphes that the solution satisfies a recovery bound of the optimal 
form ([2]). Since each iteration of the proximal gradient method cost 0{mn) flops, the overall 
computational complexity is 0(m?ilog(l/e)), implying global geometric rate of convergence. 

The low iteration complexity of our PGH method is achieved by actively exploiting the fast 
local linear convergence of the standard proximal gradient method when the solution x*(A) is sparse 
|LT92t IHYZ08J . Using the homotopy continuation strategy, the proximal gradient method at each 
stage always starts with a point that is close to its solution. Moreover, by choosing appropriate 
parameters rj and 6 in our method, we ensure that all iterates along the solution path (i.e., not 
only the final points) at each stage are sufficiently sparse. Under a RIP-like assumption on A, this 
implies that along the homotopy path, the objective function in ([T]) is effectively strongly convex, 
and hence global geometric rate can be established using Nesterov's analysis |Nes07] . 

The advantage of our method over the exact homotopy path- following approach ( [OPTOOal 
lOPTOObl IEHJT04] ) is that there is no need to keep track of all breakpoints. In fact, for large-scale 
problems, the total number of proximal gradient steps in our method can be much smaller than 
the number of nonzeros in the target solution, which is the minimum number of breakpoints the 
exact homotopy methods have to compute. This phenomenon is predicted by our low iteration 
complexity, and also confirmed in our empirical studies. 

Compared with interior-point methods (IPMs), our methods has a similar iteration complexity 
(actually better in terms of theoretical bounds), and computationally can be much more efficient 
for each iteration. The approximate homotopy strategy used in this paper is also analogous to the 
long-step path- following IPMs (e.g., |Nes96] ). in the sense that the least-squares problem becomes 
better conditioned near the regularization path (cf. central path in IPMs). However, our results 
only hold for problems with provable sparse solutions, and the parameters rj and 5 depends on the 
problem data A and the regularization parameter A. In contrast, the performance of interior-point 
methods is insensitive to the sparsity of the solution or the regularization parameter. 

As an important special case, our results can be immediately applied to noise-free compressed- 
sensing applications. Consider the basis pursuit (BP) problem 

minimize \\x\\i subject to Ax = b, (8) 

which is a special case of ^ with e = 0. Its solution can be obtained by running our PGH method 
on the ii-LS problem ([1]) with A — t- 0. In terms of satisfying the condition A > c||Az||oo, any 
A > is sufficiently large in the noise-free case because z = 0. Therefore, the global geometric 
convergence of the PGH method for BP is just a special case of the more general result for ([1]) 
developed in this paper. 

It is also worth mentioning that variants of the proximal gradient method ([5]) can be directly 
applied to the constrained LASSO formulation ([3|). Moreover, under suitable conditions and when 
the parameter A is set to nearly equal to ||x||i, geometric convergence away from the optimal 
solution can be established jANWl l] . However, for sparse recovery applications, such a result is 
less satisfactory than the homotopy approach we analyze in this paper due to the requirement 
of estimating ||x||i — which is extremely difficult to determine efficiently in practice even for the 
simple noise- free case of basis pursuit. The proof techniques are also different, and the analysis 
of geometric convergence for PGH is more difficult than that of |ANW1 1"| . because we have to 
demonstrate sparsity of all the intermediate solutions in the proximal gradient steps along the 
homotopy path. A significantly simpler argument can be used in [ANWll] , if the extra knowledge 
of II sill is known a priori. 



1.3 Outline of the paper 

In Section [21 we review some preliminaries that are necessary for developing our method and its 
convergence analysis. In Section [3l we present our proximal-gradient homotopy (PGH) method, 
and state the assumptions and the main convergence results. Section U] is devoted to the proofs of 
our convergence results. We present numerical experiments in Section [5] to support our theoretical 
analysis, and conclude in Section [6] with some further discussions. 

2 Preliminaries and notations 

In this section, we first introduce composite gradient mapping and some of its key properties 
developed in |Nes07j . Then we describe Nesterov's proximal gradient method with adaptive line 
search, which we will use to solve the ii-LS problem at each stage of our PGH method. Finally 
we discuss the restricted eigenvalue conditions that allow us to show the local linear convergence 
of Nesterov's algorithm. 

2.1 Composite gradient mapping 

Consider the following optimization problem with composite objective function: 

{0(x)^/(x) + M/(x)}, (9) 



minimize 

X 



where the function / is convex and differentiable, and ^ is closed and convex on M"". The optimality 
condition of ([9]) states that x* is a solution if and only if there exists ^ G d^{x*) such that 

(see, e.g., |Roc701 Section 27]). Therefore, a good measure of accuracy for any x as an approximate 
solution is the quantity 

a;(x)^ min ||V/(x)+^||oo. (10) 

We call uj{x) the optimality residue of x. We will use it in the stopping criterion of the proximal 
gradient method. 

Composite gradient mapping was introduced by Nesterov in |Nes07] . For any fixed point y and 
a given constant L > 0, we define a local model of (j){x) around y using a quadratic approximation 
of / but keeping ^ intact: 

V'id/; x) = f{y) + W{yf{x -y) + ^\\x- yWl + ^(a^)- 

Let 

Tiiy) = argmin ipL{y]x). (11) 

X 

Then the com,posite gradient mapping of f at y is defined as 

gUy) = L{y-TL{y)). 

In the case ^{x) = 0, it is easy to verify that giiy) = ^f{y) for any L > 0, and 1/L can be 
considered as the step-size from y to T^^y) along the direction —giiy)- The following property of 
composite gradient mapping was shown in |Nes071 Theorem 2]: 



Lemma 1. For any L > 0, 

My;TL{y)) < Hy) - ^\\9L{y)\\l 

The function / has Lipschitz continuous gradient if there exists a constant Lf such that 

||V/(rE) - V/(y)||2 < Lf\\x - yh, Vx,?/ G M^ 

A direct consequence of having Lipschitz continuous gradient is the fohowing inequahty (see, e.g., 
|Nes04l Theorem 2.1.5]): 

fiy)<f{x) + {Vf{x),y-x) + ^\\y-x\\l Vx,yGM". (12) 

For such functions, we can measure how close Tid/) is from satisfying the optimahty condition by 
using the norm of the composite gradient mapping at y. 

Lemma 2. If f has Lipschitz continuous gradients with Lipschitz constant Lf, then 

coiTUy)) < (i + ^) \\9L{y)\\2 ^ (i + ^) \\9L{y)h 

where ^^(y) is a local Lipschitz constant defined as 

\\Vf{TL{y))-Vf{y)h 
\\TL{y) -yh 

Proof. Let D(f){x)[u] denote the directional derivative of (/> at x along the direction u, i.e, 

D(j){x) [u] = Mm — (cl){x + au) — (/>(x)) . 
aio a 

Corollary 1 in |Nes07j states that for any u G M" with ||n||2 = 1, the following inequality holds: 

DHTL{y))[u]>-(^l + ^y\gL{y)h. 



In addition, it is shown in [NesOTj that for any x £ M", 



min ||V/(x) + .^||2 = — min Dcj){x)[u]. 
Ced^ix) ll«l|2=l 



(See |Nes07l Section 2].) Therefore, we have 



u:{TUy)) < ^ inin \\Vf{n{y)) + ^Ib < f 1 + ^) 



\9L 



The last desired inequality follows from the fact SL{y) < Lj. D 



Algorithm 1: {x~^,M} ^ LineSearch(A,a;,L) 



input: A > 0, X G M", L > 
parameter: 7inc > 1 
repeat 

x^ ^ Tx,l{x) 

if (px{x^) > i1^x,l{x;x+) then L ^ Ly,, 
until <j)\{x^) <= ipx,Lix;x+) 
M ^ L 
return {x~^,M} 



Algorithm 2: {x,M} ^ ProxGrad(A, e,x(°\ Lq) 



input: A > 0, e > 0, x(°) G M", Lq > L^in 
parameters: L„im > 0, 7dec > 1 
repeat for A; = 0, 1, 2, . . . 

{^(fc+i)^^^| ^ LineSearch(A,xW,Lfc) 

Lfe+i ^ max{Lmin,Mfc/7dec} 

until a;A(a;(''+^)) < e 

M ^ Mk 
return {x, Af} 



In this paper, we use the following notations to simplify presentation: 



1 

T 
Mx) = /(x) + A||x||i. 



fix) = -\\Ax-bg 



Correspondingly, we add the subscript A in specifying the composite gradient mapping: 

ipxAv'^^) = fiy) + ^fiyVix-y) + ^\\x-y\\l + x\\x\\i 

T\x{y) = argmin ipx,L{y;x) 

X 

9x,L{y) = L{y-Txx{y)) 

ujx{x) = min ||V/(x) + A^Hoo- 
^eSllxlii 

We call the process of computing T^^y) a proximal gradient step. For the ^i-LS problem, Tx^l{x) 
has the closed-form solution given in ([6|). Given the gradient V/(x), the optimality residue 0Jx{x) 
can be easily computed with 0{n) flops. 

2.2 Nesterov's gradient method with adaptive hne-search 

With the machinery of composite gradient mapping, Nesterov developed several variants of proximal 
gradient methods in |Nes07j . We use the non-accelerated primal-gradient version described in 



Algorithms [T] and [21 which correspond to (3.1) and (3.2) in |Nes07j . respectively. To use this 
algorithm, we need to first choose an initial optimistic estimate L^[^ for the Lipschitz constant Lf. 

< -Lmin < Lf, 

and two adjustment parameters 7dcc ^ 1 and 7inc > 1- A key feature of this algorithm is the 
adaptive line search: it always tries to use a smaller Lipschitz constant first at each iteration. 
Each iteration of the proximal gradient method generates the next iterate in the form of 

x('=+^) = r,,M,(x('=)), 

where Mk is chosen by the line search procedure in Algorithm ([T]). The line search procedure starts 
with an estimated Lipschitz constant Lk, and increases its value by the factor 7inc until the stopping 
criteria is satisfied. The stopping criteria for line search ensures 

where the last inequality follows from Lenmia[TJ Therefore, we have the objective value 4)\{x^ ') 
decrease monotonically with k, unless the gradient mapping g\Mf.{x^ ') = 0. In the latter case, 
according to Lemma [21 x^^~^^> is an optimal solution. 

The only difference between Algorithm [2] and Nesterov's gradient method |Nes07l (3.2)] is that 
Algorithm [21 has an explicit stopping criterion. This stopping criterion is based on the optimality 
residue uix{x^ ^^') being small. For the ^i-LS problem, it can be computed with additional 0{n) 
flops given the gradient V/(x). For other problems, depending on the form of ^, this residue may 
be hard to compute. But we can always use the alternative stopping criterion 

\\9\mM^'^)\\2 <e. 

According to Lemma [21 these two measures may differ by a factor (1 + SMh{x^^~^^>)/Mk). So the 
precision e may need to be reduced by a similar factor. 

Since / has Lipschitz constant Lf, the inequality (J12p implies that the line search procedure is 
guaranteed to terminate if L > Lf. Therefore, we have 

Lrnin < Lk < Mfc < TmcLf. (14) 

Although there is no explicit bound on the number of repetitions in the line search procedure, 
Nesterov showed that the total number of line searches cannot be too big. More specifically, let A'^^ 
be the number of operations x"^ ^ Tx^l{x) after k iterations in Algorithm [2l Lemma 3 in |Nes07] 
showed that 

A^. < fl + i^^V^ + l) + ^max{ln^^,0|. 

V ln7inc/ ln7inc I 7decimin J 

For example, if we choose 7inc = 7dcc = 2, then 

Afe<2(fc + l) + log2-^. (15) 

Nesterov established the following iteration complexities of Algorithm [2] for finding an e-optimal 
solution of the problem ([9|): 

9 



• If (f)\ is convex but not strongly convex, then the convergence is subHnear, with an iteration 
complexity 0(l/e) |Nes07t Theorem 4]; 

• If (j)\ is strongly convex, then the convergence is geometric, with an iteration complexity 
0(log(l/e)) |Nes07[ Theorem 5]. 

A nice property of this algorithm is that we do not need to know a priori if the objective function 
is strongly convex or not. It will automatically exploit the strong convexity whenever it holds. The 
algorithm is the same for both cases. 

For our interested case m < n, the objective function in Problem ([1]) is not strongly convex. 
Therefore, if we directly use Algorithm [2] to solve this problem, we can only get the 0(l/e) iteration 
complexity (even though fast local linear convergence was observed in |Nes07j when the solution 
is sparse). Nevertheless, as explained in the introduction, we can use a homotopy continuation 
strategy to enforce that all iterates along the solution path are sufficiently sparse. Under a RIP-like 
assumption on A, this implies that the objective function is effectively strongly convex along the 
homotopy path, and hence global geometric rate can be established using Nesterov's analysis. Next 
we explain conditions that characterize restricted strong convexity for sparse vectors. 

2.3 Restricted eigenvalue conditions 

We first define some standard notations for sparse recovery. For a vector x G M", let 

supp(2;) = {j : Xj / 0}, ||3;||o = |supp(x)|. 

Throughout the paper, we denote supp(a;) by S, and use S^ for its complement. We use the 
notations Xg and xgc to denote the restrictions of a vector x to the coordinates indexed by S and 
S'^, respectively. 

Various conditions for sparse recovery have appeared in the literature. The most well-known of 
such conditions is the restricted isometry property (RIP) introduced in |CT05] . In this paper, we 
analyze the numerical solution of the ii-hS problem under a slight generalization, which we refer 
to as restricted eigenvalue condition. 

Definition 1. Given an integer s > 0, we say that A satisfies the restricted eigenvalue condition 
at sparsity level s if there exists positive constants p~-{A, s) and p+{A, s) such that 

( T A At 

p+(A,s) = sup < 7r : X ^ 0, llxllo < s 

[ X^ X 

( T^ A^ A^ 

/9„(A, s) = mt < 7f : X 7^ 0, ||a::||o < s 

[ x'^ X 

Note that a matrix A satisfies the original definition of restricted isometry property with RIP 
constant v at sparsity level s if and only if /5+(^, s) < 1 + i^ and /0-(^, s) > 1 — z^. More generally, 
the strong convexity of the objective function in ([1]), namely 4>\{x)^ is equivalent to p~(^, n) > 0. 
However, since we are interested in the situation of m < n, which implies that /9_(A, n) = 0, we 
know that (^\ is not strongly convex. Nevertheless, for s < tti, it is still possible that the condition 
p„(y4, s) > holds. This means that if both x and y are sparse vectors, then ^\ is strongly convex 
along the line segment that connects x and y. Moreover, the inequality that characterize the 
smoothness of the function, namely (J12p . could use a much smaller restricted Lipschitz constant 
instead of the global constant Lj = /9_|_(A, n). More precisely, we have the following lemma. 

10 



Lemma 3. Let f{x) = {l/2)\\Ax — ^IH- Suppose x and y are two sparse vectors such that 

|supp(j;) U supp(y)| < s 
for some integer s < m. Then the following two inequalities hold: 

fiy) < fix) + {Vf{x),y-x) + ^±^\\y-x\\l (16) 

/(y) > fix) + {Vfix),y-x) + P^^\\y-x\\l (17) 

Proof. For any x, y G W^, it is straightforward to verify that if f{x) = {l/2)\\Ax — 6|||, then 

f{y)-f{x)-{Vf{x),y-x) = ^\\A{y-x)\\l 

Since the assumption |supp(a;)Usupp(y)| < s imphes ||?/ — x||o < s, we use the definition of restricted 
eigenvalues to conclude 

p.iA,s)\\y-x\\l < \\A{y-x)\\l < p+{A, s)\\y - xg. 

These lead to the two desired inequalities. D 

The inequality (fT6|) represents restricted smoothness, and (fT7|) represents restricted strong con- 
vexity. A key feature of our PGH method is that sparsity along the whole solution path can be 
enforced. Therefore the objective function in ([T]) becomes strongly convex along the solution path if 
the sparse eigenvalues in Definition [1] are well behaved (i.e., they grow slowly when s is increased). 
In such a situation, the PGH method exhibits geometric convergence along the homotopy path, 
and the convergence rate depends on a restricted condition number, defined as 

.(.,.). ^. (18) 

In particular, if the matrix A has RIP constant u at sparsity level s, then n{A, s) < (1 + i^)/(l — i^). 

3 A proximal-gradient homotopy method 

The key idea of the proximal-gradient homotopy (PGH) method is to solve ([1]) with a large regular- 
ization parameter Aq first, and then gradually decreases A until the target regularization is reached. 
For each fixed A, we employ Nesterov's proximal-gradient method described in Algorithms [U and [H 
to solve problem ([1]) up to an adequate precision. Then we use this approximate solution to warm 
start the PG method for the next value of A. 

Our proposed PGH method is listed as Algorithm [3l To make the presentation more clear, we 
use Atgt to denote the target regularization parameter. The method starts with 



Ao = Wb\\ 



Ti... 

OOl 



since this is the smallest value for A such that the £i-LS problem has the trivial solution (by 
examining the optimality condition). Our method has two parameters r] G (0,1) and 5 G (0,1). 
They control the algorithm as follows: 
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Algorithm 3: £(*§*) ^ Homotopy(A, 6, Atgt, e,Lr 



input: A G M™><", h G M", Atgt > 0, e > 0, Lj^jn > 
parameters: r? G (0, 1), (5 G (0, 1) 

initialize: Aq ^ ||A'^6||oo, x(°) ^0, Mq ^ -^min 

iV^[ln(Ao/Atgt)/ln(l/r?)J 

for K = 0,1, 2,..., TV- 1 do 

Aa'+i ^ r/A/< 

CA'+l ^ (^Aa'+i 

{x(^+i),Mi^+i} ^ ProxGrad(AA+i,eA+i,£(^\Mi^) 
end 

{xfe*),Mtgt} ^ ProxGrad(Atgt,e,x(^),M^) 
return x'^*^*) 



• The sequence of values for the regularization parameter is determined as \k = n ^o for 
K = 1, 2, . . ., until the target value Atgt is reached. 

• For each Ax except Atgt, we solve problem ([1]) with a proportional precision 5Xk- For the 
last stage with Atgt, we solve to the absolute precision e. 

As discussed in the introduction, sparse recovery by solving the £i-LS problem requires two 
types of conditions: the regularization parameter A is relatively large compared with the noise 
level, and the matrix A satisfies certain RIP or restricted eigenvalue condition. It turns out that 
such conditions are also sufficient for fast convergence of our PGH method. More precisely, we have 
the following assumption: 

Assumption 1. Suppose b = Ax + z. Let S = supp(x) and s = \S\. There exist 7 > and 
6' G (0, 1) such that 7 > (1 + S')/{1 - 6') and 

Atgt > max{4, (^ _ ^.)!^t'(i + ^/) } H-^^^H- (19) 

Moreover, there exists an integer s such that p^{A,s + 2s) > and 

- ^ lQ{rmcP+{A,s + 2s) + 2p+iA,s)) ^^ 

' > p.iA,s+~s) ^' + ^)" ^2°) 

We also assume that Lmin < 7incP+(^)^ + 2s). 

According to |ZH08j . the above assumption implies that the solution a;*(A) of ([T|) is sparse 
whenever A > Atgt; more specifically, ||a;*(A)5c||o < s (here S'^ denotes the complement of the 
support set S). In this paper, we will show that by choosing the parameters rj and 5 in Algorithm [3] 
appropriately, these conditions also imply that all iterates along the solution path are sparse. Our 
proof employs a similar argument as that of [ZHOSj . Before stating the main convergence results, 
we make some further remarks on Assumption [TJ 

• The condition (J19p states that the A must be sufficiently large to dominate the noise. Such 
a condition is adequate for sparse recovery applications because recovery performance given 
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in ([2]) achieves optimal error bound under stochastic noise model by picking A of the order 
1 1 ^^-2 1 loo [C^TOTl [ZHnSl IZhan9[ iBKlTM IKolOQi IvdGBnQl IWaiOQj . Moreover, it is also necessary 
because when A is smaller than the noise level, the solution x*(A) will not be sparse anymore, 
which defeats the practical purpose of using ii regularization. 



• The existence of s satisfying the conditions (120p is necessary and standard in sparse recovery 
analysis. This is closely related to the RIP condition of |CT05j which assumes that there 
exist some s > 0, and u £ (0, 1) such that k{A, s) < (1 + i^)/(l — i^)- In fact, if RIP is satisfied 
with I' = 0.2 at s = 193(1 + 7)5, then we may take 7inc = 2 and s = 96(1 + 7)s so that the 
condition (j20p is satisfied. To see this, let s = s + 2s and note that 

1 — z> P-[A, s + s) 

Therefore we have 

s = 96(1 + 7)s = 64- (1 + 7)s > 16 --—^ — rr (1 + 7)s. 

1 — z^ p~[A, s + s) 

Although for practical purpose these constants are rather large, it is worth mentioning that 
our analysis focuses on the high level message, without paying special attention to optimizing 
the constants. 

• If Lmin > 7incP+(^,s + 2s), then we may simply replace ji^cP+iA,s + 2s) by L^am in the 
assumption, and all theorem statements hold with 7inc/0+(^, s + 2s) replaced by Lmin. Nev- 
ertheless in practice, it is natural to simply pick 

-^min = /0+(^, 1) = max \\Ai\\l, 

ie{l,...,n} 

where Ai is the i-th column of A. It automatically satisfies the condition L^i^ < p+{A, s + 2s). 

Our first result below concerns the local geometric convergence of Algorithm [21 Basically, if 
the starting point x^^' is sparse and the optimality condition is satisfied with adequate precision, 
then all iterates along the solution path are sparse, and Algorithm [2] has geometric convergence. 
To simplify the presentation, we use a single symbol k. to denote the restricted condition number 

«-K(A5 + 25J-^_(^^-^2s)- 
Theorem 1. Suppose Assumptionll\ holds. If the initial point x^^' in Algorithmic satisfies 

\\xf.\\o<~^, o.a(x(°)) < 5'A, (21) 

then for all k >0, we have 



„<., *.(xW)-« < (l-j;i-) (0A(x<»))-«), 



where cp^ = (j)x{x*{X)) = min^; (l)x{x). 
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Our next result gives the overall iteration complexity of the PGH method in Algorithm [3j 
Roughly speaking, if the parameters 5 and r] are chosen appropriately, then the total number of 
proximal-gradient steps for finding an e-optimal solution is 0(ln(l/e)). 

Theorem 2. Suppose AssumptionUlholds with Atgt < Aq and the parameters 5 and r] in Algorithmic 
are chosen such that 

Let N = [in (Ao/Atgt) /lnr/~^J as in the algorithm. Then: 

1. The condition \21\) holds for each call of Algorithmic For K = 0, . . . ,N — 1, the number of 
proximal-gradient steps in each call of Algorithmic is no more than 

1 \^^ 




In 1 



47inc«^ 



K- 



where C = 87inc(l + i^) {^ +j)ks. Note that this bound is independent of X 
2. For K = {),... ,N — \, the outer-loop iterates x^ ' satisfies 

*.„(*'^-')-*^... <.=<-'• i;51Ll_Zl|, (22) 

and the following bound on sparse recovery performance holds 



p — x\\2 < V 



p^{A,s + s) 
3. When Algorithmic terminates, the total number of proximal- gradient steps is no more than 



\ \nr] ^ \d^ J \ ^ ^^ 1 1 / V 47incK 

and the output x*-*^*) satisfies 



-vtgt \-^ ) ^-^tgt 



4(l + 7)Atgts 



p_(As + s) 



We have the following remarks regarding these results: 

• The precision e in Algorithm [3] is measured against the optimality residue lo\{x). In terms of 
the objective gap, suppose eo > is the target precision to be reached. Let 



Kn 



1^ /4.5(l + 7)Ag.-W, -: 
2 \p^{A,-s^s)e^) I 



1. 



From the inequality ([^2|) . we see that if < Kq < N — 1, then for all K > Kq, 

If we let eo — )• and run the PGH method forever, then the number of proximal-gradient 
iterations is no more than 0(ln(Ao/eo)) to achieve an eq accuracy both on the gap of objective 
value and on the optimality residue iOx{-) < eo- This means that the PGH method achieves 
a global geometric rate of convergence. 
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When the restricted condition number k is large, we can use the approximation 

1 \-^ 1 

In ( 1 ^ 



Then the overall iteration complexity can be estimated by O {k ln(Ao/e)), which is propor- 
tional to the restricted condition number k. 

• Even if we solve each stage to high precision with ex+i = niin(e, 5A/^+i), the global conver- 
gence rate is still near geometric, and the total number of proximal-gradient steps is no more 
than 0((ln(Ao/e))2). 

Theorem [2] plus restricted strong convexity immediately implies that the approximate solu- 
tions x^^' (and the last step solution x^*^*') also converge to j;*(Atgt) at a globally geometric rate. 
A particularly interesting case is noise- free compressed sensing using the BP formulation ([8]) , which 
has the optimal solution x. For this problem, we can simply run Algorithm [3] with Atgt = to 
solve dS]). While the convergence metrics such as objective value gap or optimality residue are 
no longer informative in this case. Theorem [2] implies geometric convergence of the recovery error 
\\x^^' — x\\2. More precisely, we have: 

Corollary 1. Suppose b = Ax and the assumptions stated in Theorenn\^hold. We can choose an 
arbitrarily small Atgt > in Algorithmic and after K outer iterations, we have 

\\^(K) -II ^ A'+i 2AnA/l 
\\x^ — x\\2 < r] - 



P-{A,s + s)' 

Note that part 1 of Theorem [2] implies that K outer iterations of Algorithm [3] requires no more 
than 0{K) proximal-gradient steps. This result can be interpreted as a global geometric rate of 
convergence for solving the BP problem. 

4 Proofs of convergence results 

The proofs of our convergence results are divided into the following subsections. In Section [4.11 we 
show that under Assumption [H if x^^' is sparse and uj\{x^^') is small, then all iterates generated 
by Algorithm [2] are sparse. In Section 14. 2^ we use the sparsity along the solution path and the 
restricted eigenvalue condition to show the local geometric convergence of Algorithm O thus proving 
Theorem [TJ In Section 14.31 we show that by setting the parameters 6 and rj in Algorithm [3] 
appropriately, we have geometric convergence at each stage of the homotopy method, which leads 
to the global iteration complexity 0(log(l/e)). 

4.1 Sparsity along the solution path 

First, we list some useful inequalities that are direct consequences of the assumption (fT9l) : 

{l-6')X-\\A^z\\oo > (23) 

{l + 6')X+\\A^z\\oc < 2A (24) 

A+P^z||oo < (2-(5')A (25) 

{l + d')X+\\A^z\\ 



{l-6')X-\\ATz\\o. 
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< 7- (26) 



The following result means that if x is sparse, and it satisfies an approximate optimality condi- 
tion for minimizing (j)\, then (j)\{x) is not much larger than (j)x{x). 

Lemma 4. Suppose Assumption{l\holds, and A > Atgt- If x is sparse, i.e., \\xgc\\o < s, and it 
satisfies the approximate optimality condition 

min \\A^{Ax - b) + Adl < <5'A, (27) 

then we have 

||(:r-%.||i<7||(x-%||i (28) 

and 

II -II ^ 2Av/| 

\\x-x\\2< — , _ . (29) 

p^{A,s + s) 

and 

fe(.)<fe(.)+ ^^'''/:y; . (30) 

P-{A,s + s) 

Proof. Let ^ € 9||a;||i be a subgradient that achieves the minimum on the left-hand side of ()27p . 
Then the approximate optimality condition leads to 

{x - xf {A^ {Ax - b) + X^) < ||x-x||i||^^(^x-6) + Ae||^ 

< 5'X\\x — x\\i. 

On the other hand, we can use h = Ax -|- z to obtain 

(x - xf {A'^iAx -b) + Ae) = (x - xfA^ {A{x - x) - z) + X{x - xf( 

= \\A{x-x)\\l-{x-xfA'^z + X^'^{x-x) 
> \\A{x - x)\\l - \\x - x\\i\\A^z\\c^ + A^^(x - x). 

Next, we break the inner product ^"^(x — x) into two parts as 

^^(x - x) = C|(2; - x)s + £,'§c{x - x)sc. 

For the first part, we have (by noticing ||^||oo < 1) 

Cf(x-x)5 > - ||Cslloo||(a;-x)s||i > - ||(x-x)s||i. 

For the second part, we use the facts Xgc = and ^ G c^||2;||i to obtain 

S,§cix — X)§c = Xg^^gc = ||x_5c||i = ||(x — x)5c||l. 

Combining the inequalities above gives 

||^(x — x)||2 — ||A z||oo||a^ — 2;||i — A||(x — x)^||i -|- A||(x — x)gc||i < 5'A||x — x||i. 

Using ||x — x||i = ||(x — x)g||i -|- ||(x — x)_5c||i and rearranging terms, we arrive at 

\\A{x-x)\\l+{{l-6')X-\\A^z\\^)\\{x-x)sc\\i < {{l+6')X + \\A'^z\\^)\\{x-x)s\\i. (31) 
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By further using the inequahties (p3]) and (j26j) . we obtain 

\\ix-^)sA\i < 711(2; -^)5lli' 



which is the first desired result in (|28p . 

Since by assumption ||xgc||o < s, we can use the restricted eigenvalue condition to obtain 

P-{A,s + s)\\x-x\\l < \\A{x-x)\\l 

< {{l + 6')X+\\A^z\\oo)\\{x-x)s\\i 

< 2X\\{x-x)s\\i 

< 2XV^\\{x-x)sh 

< 2A\/I||x — x\\2, 

where the second inequality is a result of (I3ip , the third inequality follows from ()24p , and the fourth 
inequality holds because \S\ = s. This proves the second desired bound in (p9|) . 

Finally, since (j)x is convex and ^^{Ax — 6) + ^ is a subgradient of (/< at x, we have 

(pxix) - (/)x{x) < - {A^ {Ax - b) + C) (x-x) < 6'X\\x-x\\i. 
From the inequality in (|28p . we have 

||x-x||i = \\{x-x)s\\i + \\{x-x)s,\\i < {l + j)\\{x-x)§\\i. 
Therefore, 

Mx)-M^) < S'X{l + -f)\\{x-x)s\\i < 5'X{l + -f)V^\\{x-x)sh, 
which, together with (129p . leads to the third desired result. D 



The following result means that if x is sparse, and (px{x) is not much larger than (f>x{x), then 
both \\x — x\\2 and \\x — x\\i are small. 

Lemma 5. Suppose AssumptionUlholds, and X > Atgt. Consider x such that 

II II ^~ A ( ^<'A ,_, , 26'{l + ^)X's 

P-[A,s + s) 

then 

\ ^ U At -Ml2 II -II 1 ^ 4(l+7)As 



2A" ' "'^' " "J - p-{A,s+~s) 

In fact, similar results holds under the condition ujx{x) < 5'X, and are already proved in 
Lemma m However, in the proximal gradient method, the optimality residue ujx{x^ ') may not 
be monotonic decreasing, but the objective function (j)x{x^^') is. So in order to establish the desired 
results for all iterates along the solution path, we need to show them when the objective function 
is sufficiently small, which is more involved. 
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Proof. For notational convenience, let 

P-{A,s + s) ■ 
We write the assumption (j)x{x) < 4>\{x) + A explicitly as 



-\\Ax - b\\l + A||x||i < -\\Ax - 6||^ + A||x||i + A. (32) 



We can expand the least-squares part in 4>\{x) as 

\Ux-h\\l = l\\{Ax - b) + A{x - x)§ 



= l\Ux -b)\\l + ^\\A{x - x)||i + (x - xfA^iAx - b) 
> ^ll(^^ -ml + l\\A{x-x)g-\\x- xh\\A^{Ax - b 



Plugging the above inequality into (j32p . and noticing Ax — b = z, we obtain 

-P(x-x)||| - ||3;-x||iP^z||oo + A||a;||i < A||x||i + A. 
Using the fact Xgc = 0, we have 

•^1 ~~ \\Xgc\\'\ ~r Xci ^ Xcc X5ci ~r X51. 

M 11-^ II Oll-^ II tJll-^ II J O II ^ II Oll-^ 

Therefore 

-\\A{x - x)\\l - \\x - x\\i\\A^ z\\oo + Allx^c -XgcWi < Adlx^lli - Hx^Hi) + A 

< A 11x5 — xg\\i + A. 

By further splitting ||x — x||i on the left-hand side as ||(x — x)g||i + ||(x — x)_5c||i, we get 

-\\A{x-x)\\l + {X-\\A^z\\^)\\{x-x)s4i < (A + P^2||oo)||(x-x)s||i + A. (33) 

Now there are two possible cases. In the first case, we assume 

A 2(l + 7)As 

X-X 1 < —- = , _ y (34) 

o'A p-[A,s + s) 

From (I23p . we know that (A — 11^4 z||oo) ||(x — x)5c||i is nonnegative, so we can drop it from the 
left-hand side of (j33p to obtain 

ip(x-x)||i < {X+\\A^z\\^)\\{x-x)s\\i + A 
< (2A-yA)||(x-x)5||i + A 

>-(A,s + s) p-{A,s + s) 
4A(l + 7)Ag 
P-{A,s + s)' 

18 



where in the second inequahty we used (j25p . and in the third inequahty we used (j34p . This means 
the claim holds. 

In the second case, the assumption in (j34p does not hold. Then A < (^'A||a; — x||i and (|33p 
implies 

-\\A{x - x)\\l + (A - P^^lloo) 11(2; - x)5c||i < (A + \\A^z\\oo) \\{x - x)s\\i + 6'X\\x - x\\i. 
Again we split ||x — x\\i as \\{x — x)g\\i + ||(x — x)_§c||i to obtain 
^\\A{x-x)\\l + {{l-6')X-\\A'^z\\^)\\{x-x)sc\\i < {il + 6')X + \\A'^z\\oo)\\{x-x)s\\i. (35) 



By further using the inequalities (p3|) and (j26j) . we get 

\\(^-^)s4i < ji!^!|^!pT^||^ ll(^-^)5lli < 7ll(^-%lli. (36) 

Moreover, we can use the restricted eigenvalue condition and the assumption ||rE^c||o < s to obtain 
-p_{A,s + s)\\x-x\\l < -\\Aix-x)\\l 

< {{l + 6')X + Wz\\oo)\\{x-x)s\\i 

< 2A||(x-x)5||i 

< 2Xy/l\\{x - x)§\\2 

< 2Avi \\x — x\\2, 

where the second inequality follows from ()35p . the third inequality follows from ()24p . and the forth 
inequality holds because \S\ = s. Hence 

II -II ^ 4A^/I 

P-{A,s + s) 

The above arguments also imply 

where the last inequality holds because 7 > 1. Finally, using ([36]) . we get 

4(l + 7)As 



\x-x\\i < {l + 7)\\ix-x)s\\i < (l+7)Vi||(x-x)s||2 < 



P-{A,s + s) 
These prove the desired bound. D 

The following lemma means that if x is sparse and (px^x) is not much larger than (j)x{x), then 
T\l[x) is sparse. 

Lemma 6. Suppose Assumption{l\holds, and X > Atgt- Suppose x satisfies 

\\xsA\o<s, Mx)<Mx) + ^^^^j^^^, (37) 

p^[A,s + s) 

and L < 7incP+(^, s + 2s). Then 

||(7a,l(x))^4o < ^■ 
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Proof. Recall that Tx^l can be computed by the soft-thresholding operator as in (l6|). That is, 
{TL{x))i = sgn{xi) max < \xi\ " T' r ' i = l,...,n, 

where 

X = X- yA^{Ax -b) = X- —A^A{x - x) + —A^z. 

In order to upper bound the number of nonzero elements in (Ti(x))_§c, we split the truncation 
threshold A/L on elements of x^c into three parts: 

• A/4L on elements of x^c , 

• A/4L on elements of (\IV)A^z^ and 

• A/2L on elements of {l/L)A'^A{x — x). 

Since by assumption ||A-^z||oo < A/4, we have \{j : {{1/L)A'^ z)j > A/4L}| = 0. Therefore, 

\\{Tl{x))^X ^ |{iGS=:k,|>A/4L}| + |{j:|(A^A(x-x))J>A/2}|. 
Note that 

\{j E S" : \xj\ > A/4L}| = \{j eS" -.lix- x)j\ > A/4L}| 

< |{j:|(x-5;),|>A/4L}| 

< 4LA"-^||x -x||i 

< ^5^^^, (38) 

p.{A,s + s)' 

where the last inequality follows from Lemma [3 

For the last part, consider S' with maximum size s' = |5'| < s such that 

S'c{j:\iA^Aix-x)),\>X/2}. 

Then there exists u such that ||u||oo = 1 and ||ti||o = s', and s'A/2 < u'^A'^A{x — x). Moreover, 



s'A/2 < u^A^A{x-x) < \\Au\\2\\A{x - x)\\2 < ^/p+{A,s')^. 



P-{A,s + s)' 



where the last inequality again follows from Lemma [5l Taking squares of both sides of the above 
inequality gives 

,^ 32p+iA,s'){l + ^)s ^ 32p+iA,s)il + -f)s ^, 

^ ^ n~^ — ^^ ^ n~^ — ^^ — < •5; 

p^{A,s + s) p^{A,s + s) 

where the last inequality is due to (j20p . Since s' = \S'\ achieves the maximum possible value such 
that s' < s for any subset S" of {j : ^A^ A{x^^> — x))j\ > A/2}, and the above inequality shows 
that s' < s, we must have 

5' = {i:|(A^^(xW-x)),|>A/2}, 
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and thus 

\{j : |(A-A(.*) - .)),! > A/2}| = ,' < ='2..(A.-)(l+7).^ 

L P-{^i s -\- s) 

Finally, combining the above bound with the bound in (j38p gives 

l6 {L + 2p+{A, s)) 
p.{A,s + s) 






Under the assumption L < ^i^cP+i^i s + 2s) and ([2U|) . the right-hand side of the above inequality 
is less than s. This proves the desired result. D 

Recall that each iteration of Algorithm [2] takes the form x^ ~^^> = Tx^m,,{x^ )■ According to (J13p . 
the objective value (pxix^'^') is monotone decreasing. So if x^^' satisfies the condition (|37|) . every 
iterate x^ ' satisfies the same condition. In order to show 

\\{x^''^)s4o<s, Vfc>0, 

we only need to note that the line-search procedure (Algorithm [1]) always terminates with 

Mk<-fin,p+{A,s + 2s). (39) 

Indeed, as long as 

Mk G [p+{A,s + 2s),-fir,cP+{A,s + 2s)], 

Lemma [6] implies that || (r\,L(x)) ej,|| < s and the restricted smoothness property p^ implies the 
termination of line-search. 

4.2 Proof of Theorem [1] 

In this subsection, we show that for any fixed A, the sequence {x''^'},_„ generated by Algorithm [2] 
(without invoking the stopping criteria) has a limit and the local rate of convergence is geometric. 
First, since the sub-level set {x : (t)\{x) < (t)\{x^^>)} is bounded and (f)x{x^^') is monotone 
decreasing, the sequence {x'^^},_„ is bounded. By the Bolzano- Weierstrass theorem, it has a con- 
vergent subsequence and a corresponding accumulation point. Moreover, from the inequality (J13p 
and the fact that (px{x) is bounded below, we conclude that 

lim ||5a,l(x('=))||2 = 0. 

By Lemma [21 this implies that any accumulation point of the sequence {x^^},_„ satisfies the 
optimality condition, therefore is a minimizer of (px- 

Let x*{X) denote an accumulation point of the sequence {2; };j^q- As a consequence of 
Lemma[6l any accumulation point is also sparse; In particular, we have ||(2;*(A))5c||o < s. 

Now using the restricted strong convexity property (I17p . we have 



fix) > fix*) + (V/(x*(A)),x - x^iX)) + EA^2l±lfl\\x - x*(X)\\l (40) 

Since x*iX) = argmin^.{/(x) -|- A||x||i}, there must exists ^ G 9||a;*(A)||i such that 

V/(x*(A)) + Ae = 0. (41) 
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Since S, G 9||x*(A)||i, we also have 

A||x||i > A||x*(A)||i + (A^x - x*{X)). (42) 

Adding the two inequalities (j40]) and (j42l) and using (f4T|) . we get 



M^) - Mx^'iX)) > P^^^^±^\\x - x^X)\\l Vx : llxg.llo < ~s. (43) 

Since any accumulation point satisfies Hx^cHo < s, we conclude that x*{X) is a unique accumu- 
lation point, in other words, the limit, of the sequence {x^'''\,_„. 

Next we show that under the assumptions in Lemma [6l especially with x^'^' satisfying (j37p . 
Algorithm [2] has a geometric convergence rate. We start with the stopping criteria in the line 
search procedure: 

<Aa(^('+')) < V'A,A4(x('=\^('+')) 

< mm |/(x) + ^\\x- x^^^ll + A||x||i| 

= min|c/>,(x) + ^|k-xW||i 

where the second inequality follows from the convexity of /. We can further relax the right-hand side 
of the above inequality by restricting the minimization over the line segment x = ax*{X)+{l—a)x^''', 
where a £ [0, 1]. This leads to 



</>A(a;('=+i)) < mmUx(ax*{X) + il-a)x^^^) + —\\a{ 



xW-x*(A))|| 



< min <J aM^^W) + (1 " a)^^'-''^) + ^||xW - x*iX)\\l 



mm < cAa(x^')) - aiMx^"^) - M^*W)) + ^^\\x^'^ - x'{X)\\l 



Since the conclusion of Lemma [6] implies that Hx^,, ||o < s for all A; > 0, we can use the "restricted" 
strong convexity property (|43p to obtain 

</>,(x(^+i)) < mm |0,(xW) - a (^1 - -^||^) (</>,(:rW) - Mx'{X))) 

The minimizing value is a = P-{A, s + 2s)/(2Mfc), which gives 

^a(x(^-^^)) < ^a(x(^-)) - ^-^^^'^+ ''^ (0a(x(^)) - ^a(x-(A))) . 

Let (p*^ = (j)xix*{X)). Subtracting (j)\ from both side of the above inequality gives 

.*a(x"=+'>) - « < (1- "^%""''' ) (■»>(-"") -■»•>) 

s ri- /-''";;tf,., )(fe(.W)-0;), 
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where the second inequality follows from (j39p . Therefore, we have 
where 

K 






is a restricted condition number. Note that the above convergence rate does not depend on A. 

4.3 Proof of Theorem [5] 

In Algorithm [3l x^^' denotes an approximate solution for minimizing the function ^a^-- ^ ^^y i*^^^ 
of the homotopy method is to use x^ ' as the starting point in the proximal gradient method for 
minimizing the next function (px^^^-^ ■ The following lemma shows that if we choose the parame- 
ters S and r] appropriately, then x^^' satisfies the approximate optimality condition for A^'+i that 
guarantees local geometric convergence. 

Lemma 7. Suppose x'^^-* satisfies the approximate optimality condition 

ojx^ix^''^) < SXk 

for some 6 < 6' . Let Xr+i = ijXk for some rj that satisfies 



1 + 5 



1 + 5' 



7 <?7< 1. 



(44) 



Then we have 



0Jx^^^{x^'^^) < 5' \k+i. 



Proof. If WA;^ (xW) < 5\k, then there exists ^ E d\\x^^'^\\i such that ||V/(xW) + \kS,\\^ < 5Xk- 
Then we have 

WA,+i (£('')) < V/(xW) + Aa'+iC 

oo 

= V/(x(^)) + Xk^ + (Ak+1 - A,^)^ 

oo 

< V/(£(^)) + Aa'^ +|a^+i-Ax|-||?||oo 

oo 

< 5XK + {l-r])XK. 

Since the condition (H^ implies 5Xk + (1 — v)''^K < 5'Xk+i, we have the desired result. D 

Lemma 8. Assume that for some x and X > Atgt, 

^\{x) < 5'X. 
Then for all X' G [Atgt, X], we have 

2{l + -f){X + X'){ujx{x) + X-X')s 



4>y{x)-<Px'{x*{X'))< 



p.{A,s + s) 
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Proof. Let ^(A) = argmin^ggn^n^ ||V/(x) + A^H^. Thus u^ix) = ||V/(x) + Ae(A)||^. By the con- 
vexity of (py , we have 

^x'ix)-cPy{x*{X')) < {Vf{x) + X'aX),x-x*{X')) 

< ilMix) + Ae(A)|U + A - A')||x - x*(A')||i 

= {ux{x) + X-X')\\x-x*{X')\U. (45) 

By Lemma m we have 

||S-x'^(A')||i < (l + 7)^^||x-x*(A')||2 < ^^^^^ 

P-{^, s -\- S) 

and 

II- II ^n^ ^ /-ii- II ^ 2(l + 7)As 
If — 2;||i < (1+7JVs|f — x||2 < 



Therefore, we have 



\x - x*(A')||i < \\x - x\\i + \\x - 3;*(A')||i < 



,,,.„ ^2(1 + 7)(A + A')s- 



p_iA,s + s) 
Now we obtain from (1451) that 



<Pyix)-<Px'{x*{X')) < 



,,,., ^ 2(l + 7)(A + A')(a;A(x) + A-A')s 



P-{A,s + s) 
This proves the desired result. D 

Now we are ready to give an estimate of the overah complexity of the homotopy method. First, 
we need to bound the number of iterations within each call of Algorithm [2j 

Using Lemma O we can upper bound the measure for approximate optimality as 



-Xi-''^'') < (^ + ^^^]hx,M,{x^'')\ 



A4 / M-->-fcv- ^112 

^ / p+{A,s + 2s)\ II , ffcN.M 

where the second inequality follows from 

Sm,{x^'''^) < P+{A, s + 2s), Mk > P-{A, s + 2s), 

which are direct consequences of the line-search termination criterion, the restricted smoothness 
property (J16p and the restricted strong convexity property (J17p . 

In order to bound the norm of gx^M^.i^ )■, we use the inequality ()13p and Theorem [1] to obtain 

||<7A,A/,(x('=^)||2 < 2Mfc(</.,(a;('=))-<A;,(x(^-+i))) 

< 2il/4(</.A(x«)-</'^) 

< 27incP+(As + 25)(^l-^-i-) (^0;^(a;(o))-,/,^), 
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where (j)*^ = <j)\{x*{\)) = niin^,. </>A(a^)- Therefore, in order to satisfy the stopping criteria 
it suffices to ensure 



which requires 

We still need to bound the gap (l)\{x^^>) — (f)^. Since Lemma [7] implies that lox{x^^') < 6' \, we 
can obtain directly from Lemma [8] the following inequality by setting A' = A and x = x^^^': 

Therefore, the number of iterations in each call of Algorithm [2] is no more than 
^^( 8r.nc{l + K)H^ + l)sp+{A,s + 2s) \ l(^ 1 ^-^ 



To simplify presentation, we note that 

C = 87inc(l + K)^(l+7)s'^ > 87inc(l + K)^(l+7)^"' 



p.{A,s+~s) 
Thus the previous iteration bound is no more than 

In ( £ 1 / In I 1 



S"^ J / V 47incK 

This proves Part 1 of Theorem [2j We note that this bound is independent of A. 

In the homotopy method (Algorithm [3]) , after K outer iterations for K < N — 1, we have from 
Lemma[7]that uj\j^^^{x^^') < 6'Xk+i- The sparse recovery performance bound 



^.{K) 



X^ ' — X 



\2<2if+^\QV"s/p^{A,s+~s) 



follows directly from Lemma U] and Xk+i = V^^^Xq. Moreover, from Lemma [8] with A' = Atgt, 
A = \k+i, and x = x^^\ we obtain 

S, ix^^))-6\ ^ 4.5(1 +7)A^^i.- o(;,+,) 4.5(1 +7)Agg- 
This proves Part 2 of Theorem [2l 
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In Algorithm [3l the number of outer iterations, excluding the last one for Atgt, is 

ln(Ao/Atgt) 



N 



ln(l/r/) 



The last iteration for Atgt uses an absolute precision e instead of the relative precision (5 Atgt • There- 
fore, the overall complexity is bounded by 



ln(l/?7) \5^J \ ' ^^ / / / V 47incK 



Finally, when the PGH method terminates, we have u}\^ ^(a;'*^*') < e. Therefore we can apply 
Lemma [8] with A = A' = Atgt and x = x^^^^' to obtain the last desired bound in Part 3. 

5 Numerical experiments 

In this section, we present numerical experiments to supports our theoretical analysis. First, we 
illustrate the numerical properties of the PGH method by comparing it with several other methods. 
More specifically, we implemented the following methods for solving the ^i-LS problem: 

• PG: Nesterov's proximal gradient method with adaptive line search (Algorithm [2]) . 

• PGH: our proposed PGH method described in Algorithm [3j 

• ADG: Nesterov's accelerated dual gradient method, i.e.. Algorithm (4.9) in |Nes07j . 

• ADGH: the PGH method in Algorithm El but with PG replaced by ADG. 

We generated a random instance of ([T]) with dimensions m = 1000 and n = 5000. The entries of 
the matrix A S ]g"^x" are generated independently with the uniform distribution over the interval 
[—1,-1-1]. The vector x £ M" was generated with the same distribution at 100 randomly chosen 
coordinates (i.e., s = |supp(x)| = 100). The noise z G M™' is a dense vector with independent ran- 
dom entries with the uniform distribution over the interval [— cj, a], where a is the noise magnitude. 
Finally the vector b was obtained as 6 = Ax + z. In our first experiment, we set a = 0.01 and 
choose Atgt = 1- For this particular instance we have roughly ||A-^2;||oo0.411. To start the PGH 
method, we have Aq = ||A 6||oo = 483.4. 

Figure [1] illustrates various numerical properties of the four different methods for solving this 
random instance. We used the parameters 7inc = 2 and 7dec = 2 in all four methods. For the 
two homotopy methods (whose acronyms end with the letter H), we used the parameters rj = 0.7 



and 6 = 0.2. In the first four subfigures (a) (d) , the horizontal axes show the cumulative count 



of inner iterations (total number of proximal-gradient steps). For the two homotopy methods, the 



vertical line segments in the subfigures (a) , (c) and (f ) indicate switchings of homotopy stages 



(when the value of A is reduced by the factor rj) — they reflect the change of objective function or 
the optimality residue for the same vector x^'''. 

Figure [ ^a)] shows the objective gap 4>\{x^'^') — 4>x^ ^ versus the total number of iterations k. The 
PG method solves the problem with the target regularization parameter Atgt directly. For the first 
350 or so iterations, it demonstrated a slow sublinear convergence rate (theoretically 0{l/k)), but 
converged rapidly for the last 30 iterations with a linear rate. Referring to Figure [WbU we see that 
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(c) Optimality residues. 
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Figure 1: Solving a random instance of the £i-LS problem. Problem sizes: m = 1000, n = 5000, 
s = 100, and Atgt = 1- Entries of A G ]g™-x" were generated with independent uniform distributions 
over [— 1,+1], and H^Hoo = 0.01. Algorithmic parameters: 7inc = 2, 7dec = 2, 77 = 0.7, and 5 = 0.2. 
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the slow convergence phase of PG is associated with relatively dense iterates (with Hx^'^^Ho ranging 
from 5,000 to several hundreds), while the fast linear convergence in the end coincides with sparse 
iterates with Hx'^^'Ho around one hundred. In contrast, the PGH method maintains sparse iterates 
(always less than 300) along the whole solution path, and demonstrates geometric convergence at 
each stage of homotopy continuation. 

Figure [Wc)] shows the optimality residues of different methods versus the number of iterations k. 
They demonstrate similar trends as the objective function gap, but clearly they oscillate along the 
solution path and do not decrease monotonically Figure [ ^d)| plots the local Lipschitz constants 
returned by the line search procedure at each iteration. We see that the adaptive line-search method 
settles with much smaller M^ when the iterates are sparse. There is a striking similarity between 
the final stages of the PG method and the PGH method. However, the PGH method avoids the 
slow sublinear convergence by maintaining sparse iterates along its whole solution path. 

Also plotted in Figure [Dare numerical characteristics of the ADG and ADGH methods. We see 
that the ADG method is much faster than the PG method in the early phase, which can be explained 
by its better convergence rate, i.e., 0{l/k'^) instead of 0{l/k) for PG. However, it stays with the 
sublinear rate even when the iterates x^'^' becomes very sparse. The reason is that ADG cannot 
automatically exploit the local strong convexity as PG does, so it eventually lagged behind when 
the iterates became very sparse (see discussions in |Nes07j ) . In the method ADGH, we combine the 
homotopy continuation strategy with the ADG method. It improves a lot compared with ADG, 
but still does not have linear convergence and thus is much slower than the PGH method. 

Figure [We)] shows the number of proximal-gradient steps performed at each homotopy stage 
(corresponding to each A;^) of the two homotopy methods. We see that the final stage of the PGH 
method took 19 inner iterations to reach the absolute precision e = 10~^, and all earlier stages 
took only 1 to 4 inner iterations to reach the relative precision 5Xk- We note that the number of 
inner iterations at each intermediate stage stayed relatively constant, even though the tolerance 
for the optimality residue decreases as 6Xk = r]^6Xo. This is predicted by Part 1 of Theorem [2l 
The ADGH method, which employs the ADG method for solving each stage, took more number of 
inner iterations at each stage. This again reflects its lack of capability of exploiting the restricted 
strong convexity. 

The number of inner iterations is not the whole story for evaluating the performance of the 
algorithms. Figure Wf)] shows the objective gap versus the total number of matrix- vector multi- 
plications with either A or A . Evaluating the objective function f{x^') costs one matrix-vector 
multiplication, and evaluating the gradient \/f{x^') costs an additional multiplication. The esti- 
mate in (jlSp states that each proximal-gradient step in the PG method needs on average two calls 
of the oracle. But one of them is done in the line search procedure, and it requires only the function 
value. Therefore each inner iteration on average costs roughly three matrix-vector multiplications. 
On the other hand, each iteration of the ADG method on average costs eight matrix- vector multipli- 
cations jNes07j . These factors are confirmed by comparing the horizontal scales of the Figures [Wa)] 
and[Wf)| We found that the number of matrix- vector multiplications is a very precise indicator for 
the running time of each algorithm. From this perspective, the advantage of the PGH method is 
more pronounced. 

Next we conducted experiments to test the sensitivity of the PGH method with respect to the 
choices of parameters 5 and ij. Figure [2] shows the objective gap and sparsity of the iterates along 
the solution path for different 5 while keeping rj = 0.7. We see that when 6 is reduced from 0.2 
to 0.1, the iterates became slightly more sparse, hence the convergence rate at each stage can be 
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Figure 2: Performance of the PGH method by varying b while keeping r] = 0.7. 
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Figure 3: Performance of the PGH method by varying rj while keeping 5 = 0.2. 

slightly faster due to better conditioning. However, this was countered by more iterations at each 
stage required by reaching more stringent precision, and the overall number of proximal-gradient 
steps increased. On the other hand, increasing 6 to 0.8 made the intermediate stages faster by 
requiring loose precision. However, this comes at the cost of less sparse iterates, and the final stage 
suffers a slow sublinear convergence in the beginning. 

Figure [3] shows the numerical behaviors of the PGH method by varying 77 while keeping 5 = 0.2. 
We see relatively big variations of the sparsity of the iterates, but these did not affect much of 
the overall iteration count. The intermediate stages may suffer from slow convergence with less 
sparsity, but they only need to be solved to a very rough precision. It is more important to start the 
last stage with a sparse vector and enjoy the fast convergence to the final precision. It is interesting 
to note that the sufficient conditions (1 + 6) /{I + 5') < i] < 1 (in Theorem [2|) and < 5 < 5' < 1 
implies ry > 0.5. But we see that a more aggressive 77 = 0.2 still works well for this instance. 
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Figure 4: Comparison with SpaRSA and FPC. 
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5.1 Comparison with SpaRSA and FPC 

As mentioned in the introduction, similar approximate homotopy/continuation methods have been 
studied for the ii-LS problem. Here we compare the PGH method with two most relevant ones: 
sparse reconstruction by separable approximation (SpaRSA) |WNF09] . and fixed point continuation 
(FPC) |HYZ08| . In particular, the same proximal gradient method ([5]) is used in each iteration of 
both SpaRSA and FPC. Their continuation strategies are both based on reducing A by a constant 
factor at each stage. 

SpaRSA uses Barzilai-Borwein (spectral) method for choosing L^ at each step. More specifically, 
at each iteration the parameter L^ is initialized as 



L, 



\A{ 



X 



(k) 



„(fc-i) 



Jk) 



Jk-i)\ 



then it is increased by a constant factor until an acceptance criterion is satisfied. When both x^^' 
and x^ ' are sparse, say |supp(x(^) U supp(x( ')| < s for some integer s, then the above L^ 
satisfies 

P-{A,s) <Lk <p+{A,s). 



According to Section [2]3l such a line search method is able to exploit the restricted strong convexity, 
similar as the PGH method. However, the line-search acceptance criterion of SpaRSA is different 
from PGH, and they also have different stopping criteria for each homotopy stage. Global geometric 
convergence of either SpaRSA or FPC has not been established. 

In our numerical experiments, we used the monotone version of SpaRSA with continuation, 
which we call SpaRSA-MC. For FPC, we used a more recent implementation by the authors of 
[HYZ08J that also employs Barzilai-Borwein line search, which is called FPC-BB. In fact FPC-BB 
solves the equivalent problem 



minimize 

X 



f 



|x||i + ^Px-6||^ 
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Figure 5: Comparison of different methods for solving a non-sparse random instance. 

where /i = 1/A. Moreover, it further scales the matrix A so that the maximum singular value is at 
most 1. In Figures m and El the results of FPC-BB are plotted after we reversed the scalings in order 
to compare with other methods. Default options were used in both methods. SpaRSA-MC reduces 
the value of A roughly with an factor rj = 0.2, and FPC-BB has an equivalent factor r/ = 0.25. For 
meaningful comparison, we also present the results for PGH with rj = 0.2, in addition to its default 
value 1'] = 0.7. The same relative precision 5 = 0.2 was used in both cases for PGH. 

Figured] shows the numerical results of different algorithms on the same random instance studied 
in Figured! They demonstrate similar numerical properties, and SpaRSA-MC is especially similar 
to PGH with rj = 0.2. The numbers of iterations at each continuation stage depend on the specific 
stopping criteria used in different algorithms. In Figure IWbU the small number of iterations in the 
final stage of FPC-BB is a result of the relatively loose precision specified in its default options, 
which is also reflected in Figure [WaH According to Figure [ ^b)| the aggressive decreasing factors r/ 
used in SpaRSA and FPC can lead to less sparse iterates along the solution path, thus relatively 
slower convergence at the intermediate stages. But their overall iteration counts are comparable to 
PGH with J] = 0.7. 

We also conducted experiments with random problem instances where the vector x is not 
sufficiently sparse. Figure [5] shows the objective gap of different methods when solving a random 
problem instance generated similarly as the one studied in Figure [TJ The only difference is that 
here the vector x has 500 nonzero elements. In this case, all methods demonstrate sub-linear 
convergence. SpaRSA-M is the monotone version of SpaRSA without continuation. FPC-BB 
terminated prematurely because its default accuracy for its stopping criterion is too low. FPC- 
BB-HA is the result after we set a much higher accuracy in calling the FPC-BB method. It looks 
that the same higher accuracy is used in all the homotopy stages, so the number of inner iterations 
increased for each stage. We see that the algorithms with homotopy continuation still perform 
better than their single-stage counterparts, but the improvements are less impressive. Instead, the 
accelerated gradient methods ADG and ADGH outperform other methods by a big margin. 
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5.2 Basis pursuit 

Finally we present an experiment of solving the basis pursuit (BP) problem ([8]) using PGH, and 
compare it with FPC and SpaRSA. In this experiment, the matrix ^ is a partial FFT matrix. More 
specifically, we choose m = 10, 000 rows at random from the nxn FFT matrix with n = 2^^ = 65536. 
The vector x G M" has nonzero entries at only s = 1000 randomly chosen coordinates, and they 
were generated independently from the normal distribution with zero mean and unit variance. Then 
we set b = Ax in the BP problem (i.e., this is the noise- free case with z = 0). 

In this case, since ^ is a matrix with complex numbers, we need to replace all the real transpose 
in the algorithms with Hermitian transpose, and replace the soft-thresholding operator in ([7]) with 



soft(2;j, a) 



max{|xj| — a,0} 
max{|xj| — a, 0} -|- a' 
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where \xi\ denotes the modulus of the complex number Xi |WNF09] . 

The solution to the BP problem ([8]) can be obtained by letting A — )• in the ii-LS problem ([1]). 
In order to use the PGH method, we set Atgt = 10"'^'^. The same parameter was also used in calling 
SpaRSA-MC and FPC-BB. Figure [6] shows the numerical results. Again we observe remarkable 
resemblance between these methods in Figure [Ma)| However, in Figure [MbU we see the recovery 
error of SpaRSA-MC stayed at the level 10~^ while its objective function in Figure [Ma)] converged 
to zero faster than other methods. The reason is that SpaRSA has a fixed accuracy requirement 
for all continuation stages except for the last one. As shown in Figure [McH when X^ becomes very 
small, this constant accuracy is always reached within one iteration, and such a low accuracy is too 
loose for the algorithm to track the homotopy path closely. Therefore, even though the objective 
function converges to zero quickly, the recovery error stayed large. This is also confirmed through 
the denser continuation stages in the second half of SpaRSA-MC, as shown in Figure [OTc)] To see 
this, we note that the adaptive continuation used in SpaRSA is 



Xk+1 = max|r?P^(^x(^) - 6)||oo, Atgt} 



If x^ ' is an accurate solution for the stage Xk, then we have 11^4 (Ax*-^ — 6)||oo ~ ^K and thus 
Xk+1 ~ V^K with rj = 0.2 as the default value. When this is not the case, then ||A'^(j4x(^) — 6)||oo 
can be notably larger than A;^, and thus the regularization parameter reduces at a much slower 
pace. Similar as PGH, FPC-BB sets the accuracy for each continuation stage to be proportional 
to the regularization parameter, but for a different stopping criterion. With our choice of stopping 
criterion, ujx{x^ ') < 5Xk, the number of inner iterations for each continuation stage stayed roughly 
constant along the homotopy path. 

This example also demonstrates the advantage of PGH and other approximate homotopy con- 
tinuation methods over the exact homotopy path- following methods [OPTOOa] lOPTOOb] IEHJT04] . 
Figure [Mb)] shows that high-precision recovery can be obtained by PGH in less than 150 iterations 
(which corresponds to roughly 450 matrix- vector multiplications). This is much more efficient than 
using the exact homotopy path-following methods, which need to track at least 1000 breakpoints. 
In addition, their computational cost at each break point is much higher than a matrix-vector 
multiplication. 

6 Conclusion and discussions 

This paper studied a proximal-gradient homotopy method for solving the ^i-regularized least 
squares problems, focusing on its important application in sparse recovery. For such applications, 
the objective function is not strongly convex; hence the standard single-stage proximal gradient 
methods can only obtain relatively slow convergence rate. However, we have shown that under suit- 
able conditions for sparse recovery, all iterates of the proximal-gradient homotopy method along 
the solution path are sparse. With this extra sparsity structure, the objective function becomes 
effectively strongly convex along the solution path, and thus a geometric rate of convergence can be 
achieved using the homotopy approach. Our theoretical analysis are supported by several numerical 
experiments. 

We commented in the numerical experiments that accelerated gradient methods cannot auto- 
matically exploit restricted strong convexity. As discussed in |Nes041 Section 2.2] and jNes07j . they 
need to explicitly use the strong convexity parameter, or a non-trivial lower bound of it, to obtain 
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geometric convergence. In order to exploit restricted strong convexity in the ^i-LS problem with 
m < n, accelerated gradient methods need an extra facility to come up with an explicit estimate of 
the restricted convexity parameter on the fly. Nesterov gave some suggestions along this direction 
in |Nes07j . and strategies such as periodic restart have been studied recently |GLW091 IBCGllj . 
However, an in-depth investigation on this matter is beyond the scope of this paper. 
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